Introduction

Recent years have seen the development of numerous algorithms and computational packages for the analysis of multi-omics data sets. At this point, one can find multiple review articles summarizing progress in the field [@subramanian2020; @graw2021; @heo2021; @picard2021; @reel2021; @vlachavas2021; @adossa2021]. As with other applications of machine learning, the kinds of problems addressed by these algorithms are divided into two categories: unsupervised (e.g., clustering or class discovery) or supervised (including class comparison and class prediction) [@simon2003]. Advances in the area of unsupervised learning have been broader and deeper than advances on supervised learning.

One of the most effective unsupervised methods is Multi-Omic Factor Analysis (MOFA) [@argelaguet2018; @argelaguet2020]. A key property of MOFA is that it does not require all omics assays to have been performed on all samples under study. In particular, it can effectively discover class structure across omics data sets even when data for many patients have only been acquired on a subset of the omics technologies. As of this writing, we do not know of any supervised multi-omics method that can effectively learn to predict outcomes when samples have only been assayed on a subset of the omics data sets.

MOFA starts with a standard method – Latent Factor Analysis – that is known to work well on a single omics data set. It then fits a coherent model that identifies latent factors that are common to, and able to explain the data well in, all the omics data sets under study. Our investigation (unpublished) of the factors found by MOFA suggests that, at least in come cases, it is approximately equivalent to a two-step process:

  1. Use principal components analysis to identify initial latent factors in each individual omics data set.
  2. For each pair of omics data sets, use overlapping samples to train and extend models of each factor to the union of assayed samples.

That re-interpretation of MOFA suggests that an analogous procedure might work for supervised analyses as well. In this article, we describe a two-step algorithm, which we call “plasma”, to find models that can predict time-to-event outcomes on samples from multi-omics data sets even in the presence of incomplete data. We use partial least squares (PLS) for both steps, using Cox regression to learn the single omics models and linear regression to learn how to extend models from one omics data set to another. To illustrate the method, we use a subset of the esophageal cancer (ESCA) data set from The Cancer Genome Atlas (TCGA).

Methods

Our computational method is implemented and the data are available in the plasma package.

suppressWarnings( library(plasma) )
packageVersion("plasma")
## [1] '0.7.6'

Data

The results included here are in whole or part based upon data generated by the TCGA Research Network. We downloaded the entire esophageal cancer Level 3 data set [@tcga2017] from the Genomics Data Commons (GDC) [@jensen2017] on 6 August 2018. We filtered the data sets so that only the most variable, and presumably the most informative, features were retained. Here, we load this sample data set.

loadESCAdata()
sapply(assemble, dim)
##      ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,]          53            6 566    1454    926    2520  192
## [2,]         185          185 184     185    166     157  126
  1. From TCGA, we obtained 162 columns of clinical, demographic, and laboratory data on 185 patients. We removed any columns that always contained the same value. We also removed any columns whose values were missing in more than 25% of the patients. We converted categorical variables into sets of binary variables using one-hot-encoding. We then separated the clinical data into three parts:
    1. Outcome (overall survival)
    2. Binary covariates (53 columns)
    3. Continuous covariates (6 columns)
  2. Exome sequencing data for 184 patients with esophageal cancer was obtained as mutation allele format (MAF) files. We removed any gene that was mutated in fewer than 3% of the samples. The resulting data set contained 566 mutated genes.
  3. Methylation data for 185 ESCA patients was obtained as beta values computed by the TCGA from Illumina Methylation 450K microarrays. We removed any CpG site for which the standard deviation of the beta values was less than 0.3. The resulting data set contained 1,454 highly variable CpG’s.
  4. Already normalized sequencing data on 2,566 microRNAs (miRs) was obtained for 185 patients. We removed any miR for which the standard deviation of normalized expression was less than 0.05, which left 926 miRs in the final data set.
  5. Already normalized sequencing data on 20,531 mRNAs was obtained in 184 patients. We removed any mRNA whose mean normalized expression was less than 6 or whose standard deviation was less than 1.2. The final data set included 2,520 mRNAs.
  6. Normalized expression data from reverse phase protein arrays (RPPA) was obtained from antibodies targeting 192 proteins in 126 patients. All data were retained for further analysis.

Finally, in order to be able to illustrate the ability of the plasma algorithm to work in the presence of missing data, we randomly selected 10% of the patients to remove from the miRSeq data set (leaving 166 patients) and 15% of the patients to remove from the mRNASeq data set (leaving 157 patients). We provid a summary of the outcome data below.

Imputation

We recommend imputing small amounts of missing data in the input data sets. The underlying issue is that the PLS models we use for individual omics data sets will not be able to make predictions on a sample if even one data point is missing. As a result, if a sample is missing at least one data point in every omics data set, then it will be impossible to use that sample at all.

For a range of available methods and R packages, consult the CRAN Task View on Missing Data. We also recommend the R-miss-tastic web site on missing data. Their simulations suggest that, for purposes of producing predictive models from omics data, the imputation method is not particularly important. Because of the latter finding, we have only implemented two simple imputation methods in the plasma package:

  1. meanModeImputer will replace any missing data by the mean value of the observed data if there are more than five distinct values; otherwise, it will replace missing data by the mode. This approach works relatively well for both continuous data and for binary or small categorical data.
  2. samplingImputer replaces missing values by sampling randomly from the observed data distribution.
set.seed(54321)
imputed <- lapply(assemble, samplingImputer)

Computational Approach

The plasma algorithm is based on Partial Least Squares (PLS), which has been shown to be an effective method for finding components that can predict clinically interesting outcomes [@bastien2015]. The workflow of the plasma algorithm is illustrated in Figure 1 in the case of three omics data sets. First, for each of the omics data sets, we apply the PLS Cox regression algorithm (plsRcox Version 1.7.6 [@bertrand2021]) to the time-to-event outcome data to learn three separate predictive models (indicated in red, green, and blue, respectively). Each of these models may be incomplete, since they are not defined for patients who have not been assayed (shown in white) using that particular omics technology. Second, for each pair of omics data sets, we apply the PLS linear regression algorithm (pls Version 2.8.1 [@mishra2022]) to learn how to predict the coefficients of the Cox regression components from one data set using features from the other data set. This step extends (shown in pastel red, green, and blue, resp.) each of the original models, in different ways, from the intersection of samples assayed on both data sets to their union. Third, we average all of the different extended models (ignoring missing data) to get a single coherent model of component coefficients across all omics data sets. Assuming that this process has been applied to learn the model from a training data set, we can evaluate the final Cox regression model on both the training set and a test set of patient samples.

Figure 1 : Workflow schematic for plasma algorithm with three omics data sets. See main text for an explanation.

Figure 1 : Workflow schematic for plasma algorithm with three omics data sets. See main text for an explanation.

All computations were performed in R version 4.2.1 (2022-06-23 ucrt) of the R Statistical Software Environment [@Rbook]. Cox proportioanl hazards models for survival analysis were fit using version 3.3.1 of e survival R package. We used additional exploratory graphical tools from version 1.3.1 of the beanplot R package [@kampstra2008] and version 1.5.1 of the Polychrome R package [@coombes2019].

Terminology

Because of the layered nature of the plasma algorithm, we intend to use the following terminology to help clarify the later discussions.

  1. The input data contains a list of omics data sets.
  2. Each omics data set contains measurements of multiple features.
  3. The first step in the algorithm uses PLS Cox regression to find a set of components. Each component is a linear combination of features. The components are used as predictors in a Cox proportional hazards model, which predicts the log hazard ratio as a linear combination of components.
  4. The second step in the algorithm creates a secondary layer of components. We do not give these components a separate name. They are not an item of particular focus; we view them as a way to extend the first level components to more samples by “re-interpreting” them in other omics data sets.

Preparing the Data

To be consistent with the MOFA2 R package [@argelaguet2020], all of the data sets are arranged so that patient samples are columns and assay features are rows. Our first task is to pad each data set with appropriate NA’s to ensure that each set includes the same patient samples in the same order, where that order matches the outcome data frame.

MO <- prepareMultiOmics(imputed, Outcome)
summary(MO)
## Datasets:
##      ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,]          53            6 566    1454    926    2520  192
## [2,]         185          185 185     185    185     185  185
## Outcomes:
##    patient_id  vital_status days_to_death    days_to_last_followup      Days       
##  a3i8   :  1   alive:108    Min.   :   9.0   Min.   :   4.0        Min.   :   4.0  
##  a3ql   :  1   dead : 77    1st Qu.: 180.0   1st Qu.: 336.5        1st Qu.: 232.0  
##  a3y9   :  1                Median : 351.0   Median : 402.5        Median : 400.0  
##  a3ya   :  1                Mean   : 495.2   Mean   : 570.1        Mean   : 538.9  
##  a3yb   :  1                3rd Qu.: 650.0   3rd Qu.: 696.8        3rd Qu.: 681.0  
##  a3yc   :  1                Max.   :2532.0   Max.   :3714.0        Max.   :3714.0  
##  (Other):179                NA's   :108      NA's   :77

We see that the number of patients in each data set is now equal to the number of patients with clinical outcome data.

Split Into Training and Test

As indicated above, we want to separate the data set into training and test samples. We will use 60% for training and 40% for testing.

set.seed(54321)
splitVec <- rbinom(nrow(Outcome), 1, 0.6)

Figure 2 presents a graphical overview of the number of samples (N) and the number of features (D) in each omics component of the training and test sets.

trainD <- MO[, splitVec == 1]
testD <- MO[, splitVec == 0]
opar <- par(mai = c(1.02, 1.32, 0.82, 0.22), mfrow = c(1,2))
plot(trainD, main = "Train")
plot(testD, main = "Test")
Figure 2 : Overview of training and test data.

Figure 2 : Overview of training and test data.

par(opar)

Results

Individual PLS Cox Regression Models

The first step of the plasma algorithm is to fit PLS Cox models on each omics data set using the function fitCoxModels. The returned object of class MultiplePLSCoxModels contains a list of SingleModel objects, one for each assay, and within each there are three regression models:

  • The plsRcoxmodel contains the coefficients of the components learned by PLS Cox regression. The number of components is determined automatically as a function of the logarithm of the number of features in the omics data set. The output of this model is a continuous prediction of “risk” for the time-to-event outcome of interest.
  • Two separate models are constructed using the prediction of risk on the training data.
    • The riskModel is a coxph model using continuous predicted risk as a single predictor.
    • The splitModel is a coxph model using a binary split of the risk (at the median) as the predictor.
suppressWarnings( firstPass <- fitCoxModels(trainD, timevar = "Days",
                          eventvar = "vital_status", eventvalue = "dead") )
summary(firstPass)
## An object containing MultiplePLSCoxModels based on:
## [1] "ClinicalBin"  "ClinicalCont" "MAF"          "Meth450"      "miRSeq"       "mRNASeq"     
## [7] "RPPA"
if (!interactive()) {
  plot(firstPass, legloc = "bottomleft") # margins too small inside RStudio window
}
Figure 3 : Kaplan-Meier plots of overall survival on the training set from separate PLS Cox omics models

Figure 3 : Kaplan-Meier plots of overall survival on the training set from separate PLS Cox omics models

On the training set, each of the seven contributing omics data sets is able to find a PLS model that can successfully separate high risk from low risk patients (Figure 3).

Extend Model Components Across Omics Data Sets

The second step of the algorithm is to extend the individual omics-based models across other omics data sets. This step is performed using the plasma function, which takes in the previously created objects of class multiOmics and MultiplePLSCoxModels. The function operates iteratively, so in our case there are seven different sets of predictions of the PLS components. These different predictions are averaged and saved internally as a data frame called meanPredictions. The structure of models created and stored in the plasma object is the same as for the separate, individual, omics models. Figure 4 shows the Kaplan-Meier plot using the predicted risk, split at the median value, on the training data set.

pl <- plasma(MO, firstPass)
plot(pl, legloc = "topright", main = "Training Data", xlab = "Time (Days)")
Figure 4 : Kaplan-Meier plot of overall survival on the training set using the unified `plasma` Cox model.

Figure 4 : Kaplan-Meier plot of overall survival on the training set using the unified plasma Cox model.

Independent Test Set

Now we want to see how well the final composite model generalizes to our test set. Figure 5 uses the predicted risk, split at the median of the training data, to construct a Kaplan-Meier plot on the test data. The model yields a statistically significant (p = 0.0063) separation of outcomes between the high and low risk patients.

testpred <- predict(pl, testD)
plot(testpred, main="Testing Data", xlab = "Time (Days)")
Figure 5 : Kaplan-Meier plot of overall survival on the test set.

Figure 5 : Kaplan-Meier plot of overall survival on the test set.

Interpretation

At this point, our model appears to be a fairly complex black box. We have constructed a matrix of components, based on linear combinations of actual features in different omics data sets. These components can then be combined in yet another linear model that predicts the time-to-event outcome through Cox regression. In this section, we want to explore how the individual features from different omics data sets contribute to different model components.

Our first act toward opening the black box is to realize that not all of the components discovered from the individual omics data sets survived into the final composite model. Some components were eliminated because they appeared to be nearly linearly related to components found in other omics data sets. So, we can examine the final composite model more closely.

pl@fullModel
## Call:
## coxph(formula = Surv(Days, vital_status == "dead") ~ ClinicalBin1 + 
##     MAF1 + Meth4501 + Meth4502 + Meth4503 + Meth4504 + mRNASeq1 + 
##     RPPA2 + RPPA3, data = riskDF)
## 
##                 coef exp(coef) se(coef)      z        p
## ClinicalBin1  0.7458    2.1082   0.4767  1.564 0.117715
## MAF1          2.2183    9.1914   0.8789  2.524 0.011602
## Meth4501     -0.1429    0.8668   0.0712 -2.007 0.044727
## Meth4502      1.6820    5.3761   0.8465  1.987 0.046930
## Meth4503      1.1703    3.2228   0.3373  3.469 0.000522
## Meth4504     -1.8067    0.1642   0.7077 -2.553 0.010683
## mRNASeq1      0.2624    1.3001   0.1003  2.615 0.008911
## RPPA2         0.9545    2.5975   0.2887  3.306 0.000947
## RPPA3         0.8029    2.2321   0.3239  2.479 0.013169
## 
## Likelihood ratio test=60.41  on 9 df, p=1.12e-09
## n= 185, number of events= 77
temp <- terms(pl@fullModel)
mainterms <- attr(temp, "term.labels")
rm(temp)
mainterms
## [1] "ClinicalBin1" "MAF1"         "Meth4501"     "Meth4502"     "Meth4503"     "Meth4504"    
## [7] "mRNASeq1"     "RPPA2"        "RPPA3"

We see that at least one component discovered from four of the five “true” omics data sets survived in the final model; only the miR components failed to make the cut. In addition, one component from the binary clinical data ws retained in the final model.

Our interest now turns to understanding how the features from individual omics data sets contribute to the components that are used in the final model. As mentioned earlier, these contributions are mediated through two levels of linear regression models when extending a model from data set A to data set B. A linear combination of features from set B is used to define the secondary level of components; then a linear combination of these components is used to predict the components of the single Cox model built that had been from set A. These weights can be combined and extracted using the getAllWeights function, and can then be explored.

Clinical Binary Data

We use the binary clinical data set to begin illustrating one method for interpreting the components.

library(oompaBase)
HG <-  blueyellow(64)
cbin <- getAllWeights(pl, "ClinicalBin")
compcolors <- c("forestgreen", "purple")[1 + 1*(colnames(cbin@contrib) %in% mainterms)]
heat(cbin, cexCol = 0.9, cexRow = 0.5, col = HG, ColSideColors = compcolors)
Figure 6 : Unscaled heatmap of the contributions of binary clinical features to all components (purple = retained in final model, green = not retained).

Figure 6 : Unscaled heatmap of the contributions of binary clinical features to all components (purple = retained in final model, green = not retained).

Figure 6 shows the raw weights for each clinical binary feature in all of the original omics components. We would like to simplify this plot in several ways. First, we can remove any components that were not retained in the final model (indicated by the green color bar in the top dendrogram). Second, we hypothesize that some components intrinsically have a wider spread of weights, and that it might be more important to scale the components consistently to look at the relative contributions. Finally, we can remove any features that seem to make no contributions to any of the components; that is; those that do not have highly ranked weights (by absolute value) in any component.

shrink <- function(dset, N) {
  dset@contrib <- scale(dset@contrib)               # standardize
  feat <- unique(unlist(as.list(getTop(dset, N))))  # remove useless features
  dset@contrib <- dset@contrib[feat, mainterms]     # remove unused components
  dset
}
xbin <- shrink(cbin, 4)
heat(xbin, cexCol = 0.9, cexRow = 0.9, col = HG)
Figure 7 : Scaled heatmap of the contributions of filtered binary clinical features to important components.

Figure 7 : Scaled heatmap of the contributions of filtered binary clinical features to important components.

In Figure 7, we can identify strong contrasts between several pairs of variables. For example, one set of components is enriched with white, never smokers, who still have evidence of tumors, at stage T3 and grade 3 in the lower third of the esophagus (ICD-10 code C15.5), while another group is enriched for Asian, current smokers, who are tumor-free, with stage N0, T2 tumors from the lower third of the esophagus (ICD-10 code C15.4).

mRNA-Sequencing Data

We can apply the same method to visualize contributors from each of the omics data sets. As a second illustration, we look at the standardized weights from the mRNA data set in the components that are part of the final model, keeping only those features that are highly ranked by absolute weight in at least one component (Figure 8).

mrna <- getAllWeights(pl, "mRNASeq")
xmrna <- shrink(mrna, 7)
tmp <- rownames(xmrna@contrib)
rownames(xmrna@contrib) <- sapply(strsplit(tmp, "\\."), function(x) x[1])
heat(xmrna, cexCol = 0.9, cexRow = 0.6, col = HG)
Figure 8 : Scaled heatmap of the contributions of filtered mRNA features to important components

Figure 8 : Scaled heatmap of the contributions of filtered mRNA features to important components

Uniting the Contributors

One difficulty with the heatmaps in the previous section is that they are focused on individual input data sets, and not on individual components. In order to fully understand which features contribute, for example, to the first mutation component (MAF1), one would have to scan all the heatmaps from all the datasets and then try to combine the influences. In order to help with that procedure, we can merge all the contributions into a single data frame, with an accompanying factor tracking the source data set.

CW <- combineAllWeights(pl)
contra <- CW@combined
datasrc <- CW@dataSource

Figure 9 displays the mean, standard deviation (SD), median, and median absolute deviation (MAD) of the weight-contributions for each data set in each component.

image(CW)
Figure 9 : Summary statistics of weights by component and dataset.

Figure 9 : Summary statistics of weights by component and dataset.

Standardized Weights

To remain consistent with the previous heatmaps, we have standardized the weights in each data set and component. In Figure 10, we create beanplots showing the distributions of weights arising from each data set in each (retained) component. (Similar plots for the raw weights are available in Supplementary Data.) Some data sets have very different contribution patterns than others. For example, the miRSeq data set appears to have significant outliers making large contributions in almost every component, the MAF and RPPA data sets also frequently (but not always) include such outliers.

library(beanplot)
library(Polychrome)
data(palette36)
foo <- computeDistances(palette36[3:36])
colist <- as.list(palette36[names(foo)[1:7]])
brute <- stdize(CW, "standard")
opar <- par(mfrow = c(3,3), mai = c(0.2,0.2, 0.5, 0.2))
for (i in which(colnames(contra) %in% mainterms)) {
  beanplot(brute[, i] ~ datasrc, what = c(1,1,1,0), col = colist,
           main = paste("Std Wts, Component", i))
}
Figure 10 : Distributions of standardized weights by data set and component.

Figure 10 : Distributions of standardized weights by data set and component.

par(opar)
rm(opar)

We also include plots of the histograms of distributions by component (Figure 11). None of these really looks quite normal; almost all have some slightly odd shape.

opar <- par(mfrow = c(3, 3), mai = c(0.2,0.2, 0.5, 0.2))
for (i in which(colnames(contra) %in% mainterms)) {
  hist(brute[,i], breaks = 77, main = paste("Component", i))
}
Figure 11 : Histogram of standardized weights by component.

Figure 11 : Histogram of standardized weights by component.

par(opar)
rm(opar)

Data Set Sources of Top Twenty Lists

Next, we want to see how many items in the lists of “top twenty” contributers to each component come from each data set. The results are shown in Figure 12. Using the raw weights, the vast majority of contributions come from the clinical binary data, with secondary contributions from the MAF and RPPA data sets (as we expected from the above distribution plots). After standardization, most of the contributions arise from miRs, but the methylation, mRNA and, to a lesser extent, the MAF and RPPA data sets also are present.

top20 <- apply(contra, 2, function(X) {
  A <- abs(X)
  S <- rev(sort(A))
  which(A > S[21])
})
top20types <- apply(top20, 2, function(X) {
  table(datasrc[X])
})

chop20 <- apply(brute, 2, function(X) {
  A <- abs(X)
  S <- rev(sort(A))
  which(A > S[21])
})
chop20types <- apply(chop20, 2, function(X) {
  table(datasrc[X])
})



opar <- par(mfrow = c(1, 2))
image(1:7, 1:24, top20types, main = "Raw Weights", ylab = "Components", xlab = "Data Sets")
mtext(levels(datasrc), side = 3, at = 1:7, line = 0)
image(1:7, 1:24, chop20types, main = "Standardized Weights", ylab = "Components", xlab = "Data Sets")
mtext(levels(datasrc), side = 3, at = 1:7, line = 0)
Figure 12 : Number of contributers to top twenty lists.

Figure 12 : Number of contributers to top twenty lists.

par(opar)
rm(opar)

Overall Distribution of Weights

In Figure 13, we pool all the weights (across all data sets and all components) to look at histograms of the distributions. We also overlay the theoretical normal distribution that one would expect to see. Using the usual mean and distribution (right panel), the actual weights are slightly more conservative (i.e., concentrated near zero) than expected. This figure suggests that we might want to use standardization, and decide on significance purely from the theoretical normal distribution rather than from deviations away from that distribution.

opar <- par(mfrow = c(1, 2))
hist(contra, breaks = 123, main = "Raw Weights", prob = TRUE)
xx <- seq(-10, 10, length=1001)
yy <- dnorm(xx, mean(contra), sd(contra))
lines(xx, yy, col = "red", lwd=2)
hist(brute, breaks = 123, main = "Standardized Weights", prob = TRUE)
yy <- dnorm(xx)
lines(xx, yy, col = "red", lwd=2)
Figure 13 : Histograms of all weights (combined).

Figure 13 : Histograms of all weights (combined).

par(opar)
rm(opar)

Number of Contributers Selected by Normal Significance

Finally, we create yet another image (Figure 14), counting the number of significant features from each data set for each component, when using a significance cutoff of 5% derived from the standard normal distribution. We feel that this result is more reasonable than any thing we got just by looking at the top 20 lists. Most contributions come from the biggest omics data sets (mRNA, methylation, and miR) with fewer from MAF, RPPA, and clinical binary.

Q <- qnorm(0.975) # two-sided 5% cutoff
top1p <- aggregate(brute, list(datasrc), function(X) sum(abs(X) > Q)) # top 5 percent
rownames(top1p) <- top1p[, 1]
top1p <- as.matrix(top1p[, -1])
top1p <- top1p[, mainterms]
image(1:7, 1:9, top1p, ylab = "Components", xlab = "Data Sets")
mtext(levels(datasrc),side = 3, at = 1:7, line = 0)
Figure 14 : Number of significant contributions by data set and component.

Figure 14 : Number of significant contributions by data set and component.

Interpeting the MAF1 Component

In the final Cox proportional hazards model of overall survival, the component with the largest hazard ratio was “MAF1” (the first feature discovered from the MAF mutation data set), for which each one unit increase in the component corresponds to a 9-fold increase in the hazard. Our next goal is to find a biological interpretation of this component. First, here is an overview of all the contributing features.

pickA <- interpret(CW, "MAF1", 0.05)
summary(pickA)
##    Feature                   Source        Weight      
##  Length:286         ClinicalBin :  3   Min.   :-5.058  
##  Class :character   ClinicalCont:  1   1st Qu.:-2.599  
##  Mode  :character   MAF         :  6   Median :-2.178  
##                     Meth450     : 83   Mean   :-1.216  
##                     miRSeq      : 46   3rd Qu.: 1.961  
##                     mRNASeq     :131   Max.   : 7.607  
##                     RPPA        : 16

We start by looking more closely at the clinical features.

pickA[1:4, ]
##                                   Feature       Source    Weight
## gender.female               gender.female  ClinicalBin -2.153483
## gender.male                   gender.male  ClinicalBin  2.113973
## pathology_T_stage.t2 pathology_T_stage.t2  ClinicalBin -2.170436
## weight                             weight ClinicalCont  2.010542

Because of our decision to use one-hot encoding of categorical variables, our data set includes separate features for “male” and”female”. Both terms are strongly related to the MAF1 component, but (as one would hope) with approximately equal standardized weights but opposite signs. Being female decreases the hazard; being male increases it. Since the coefficient of MAF1 in the final model of overall survival is itself positive, we can infer the direction that the hazard changes. It is harder to “eyeball” the magnitude of the change in the hazard, since these coefficients only measure the relative contribution of these factors to the MAF1 component.

Many of the feature names obtained from TCGA include extra annotations that we won’t use later, so we are going to simplify them.

This code gets out all of the illumina genes and not just the first element of the list

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(tidyr)
illumina<- read_csv("https://webdata.illumina.com/downloads/productfiles/humanmethylation450/humanmethylation450_15017482_v1-2.csv", skip=7) 
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 486428 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (25): IlmnID, Name, AddressA_ID, AlleleA_ProbeSeq, AlleleB_ProbeSeq, Infinium_Design_Typ...
## dbl  (4): AddressB_ID, Genome_Build, MAPINFO, Coordinate_36
## lgl  (4): Random_Loci, Methyl27_Loci, Enhancer, DHS
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
illuminaselect <- illumina %>% select( c("IlmnID", "UCSC_RefGene_Name"))
illuminaselect2<- separate_rows(illuminaselect, UCSC_RefGene_Name, sep=";")
illuminaselect2 <- as.data.frame(illuminaselect2)
illuminaselect3<- illuminaselect2 [which(illuminaselect2$IlmnID %in% pickA$Feature),]
illuminaselect4<- illuminaselect3[match(pickA$Feature[ pickA$Source=="Meth450"], illuminaselect3$IlmnID),]
illuminaselect4$UCSC_RefGene_Name[which(is.na(illuminaselect4$UCSC_RefGene_Name))] <- ""
rap <- pickA$Feature[pickA$Source == "RPPA"]
rap <- sapply(strsplit(rap, "\\."), function(x) x[1])
map <- pickA$Feature[pickA$Source == "MAF"]
nap <- pickA$Feature[pickA$Source == "mRNASeq"]
nap <- sapply(strsplit(nap, "\\."), function(x) x[1])
nap <- nap[-1]
nick <- rep("", nrow(pickA))
nick[pickA$Source == "RPPA"] <- rap
nick[pickA$Source == "MAF"] <- map
nick[pickA$Source == "mRNASeq"] <- c("", nap)
nick[pickA$Source=="Meth450"]<- illuminaselect4$UCSC_RefGene_Name
pickA$Nickname <- nick
#write.csv(pickA, file = "pickMAF1.csv")

Here are the mutated genes (from the MAF data set) associated with the component “MAF1”.

pickA[nick %in% map, ]
##         Feature Source    Weight Nickname
## CFAP54   CFAP54    MAF -2.045903   CFAP54
## DNAH9     DNAH9    MAF -2.061517    DNAH9
## DYNC2H1 DYNC2H1    MAF -1.994866  DYNC2H1
## FBN3       FBN3    MAF -2.029872     FBN3
## MYH13     MYH13    MAF -2.008811    MYH13
## PCDHA12 PCDHA12    MAF -1.995470  PCDHA12

Note that they all have negative coefficients, meaning that having these mutations decreases the effect of this component. Since the coefficient of MAF1 in the final model of overall survival is itself positive, that means that having any of these mutations decreases the hazard for that patient. Here are the Entrez Gene descriptions of the genes:

CFAP54 (Cilia And Flagella Associated Protein 54)
Predicted to be involved in cilium assembly; cilium movement involved in cell motility; and spermatogenesis. Predicted to act upstream of or within cerebrospinal fluid circulation; motile cilium assembly; and mucociliary clearance. Predicted to be located in axoneme.
DNAH9 (Dynein Axonemal Heavy Chain 9)
This gene encodes the heavy chain subunit of axonemal dynein, a large multi-subunit molecular motor. Axonemal dynein attaches to microtubules and hydrolyzes ATP to mediate the movement of cilia and flagella.
DYNC2H! (Dynein Cytoplasmic 2 Heavy Chain 1)
This gene encodes a large cytoplasmic dynein protein that is involved in retrograde transport in the cilium and has a role in intraflagellar transport, a process required for ciliary/flagellar assembly. Mutations in this gene cause a heterogeneous spectrum of conditions related to altered primary cilium function and often involve polydactyly, abnormal skeletogenesis, and polycystic kidneys.
FBN3 (Fibrillin 3)
This gene encodes a member of the fibrillin protein family. Fibrillins are extracellular matrix molecules that assemble into microfibrils in many connective tissues. This gene is most highly expressed in fetal tissues and its protein product is localized to extracellular microfibrils of developing skeletal elements, skin, lung, kidney, and skeletal muscle. This gene is potentially involved in Weill-Marchesani syndrome.
MYH13 (Myosin Heavy Chain 13)
Predicted to enable microfilament motor activity. Predicted to be involved in muscle contraction. Predicted to act upstream of or within cellular response to starvation. Located in extracellular exosome.
PCDHA12 (Protocadherin Alpha 12)
This gene is a member of the protocadherin alpha gene cluster, one of three related gene clusters tandemly linked on chromosome five that demonstrate an unusual genomic organization similar to that of B-cell and T-cell receptor gene clusters. The alpha gene cluster is composed of 15 cadherin superfamily genes related to the mouse CNR genes and consists of 13 highly similar and 2 more distantly related coding sequences. The tandem array of 15 N-terminal exons, or variable exons, are followed by downstream C-terminal exons, or constant exons, which are shared by all genes in the cluster. The large, uninterrupted N-terminal exons each encode six cadherin ectodomains while the C-terminal exons encode the cytoplasmic domain. These neural cadherin-like cell adhesion proteins are integral plasma membrane proteins that most likely play a critical role in the establishment and function of specific cell-cell connections in the brain.

These descriptions clearly indicate some common functional relationships between the mutated genes, including a role in cilia, flagella, and microfibrils.

We then extracted all gene names from the MAF, mRNASeq, and RPPA data sets and used thm to perform a gene enrichment (pathway) analysis using ToppGene [@chen2009]. Associated GeneOntology Biological Process categories included keratinization, epidermal and epithelial cell development, cell adhesion, intermediate filament organization, and wound healing. GeneOntology Cellular Components included cell-cell junctions, extracellular matrix, and intermediate filaments. Associated human phenotypes included hyperkeratosis (particularly follicular hyperkeratosis), epidermal thickening, and oral leukoplakia. Associated pathways included keratinization, gap junction assembly and trafficking, and both ErbB and mTOR signaling. Associated cytogenetic regions included 1q21-1q22, 18q12.1, and 12q12.13. Associated gene families included cadherins, kallikreins, keratins, and gap-junction proteins. Associated diseases included hyperkertosis, squamous cell carcinoma of the head and neck, intraepithelial neoplasia, endometrial carcinoma, basal-like breast carcinoma, and esophageal carcinoma.

Export ToppGene queries for each component

Look at the components that were picked out in the final model, in addition to MAF1.

print(mainterms)
## [1] "ClinicalBin1" "MAF1"         "Meth4501"     "Meth4502"     "Meth4503"     "Meth4504"    
## [7] "mRNASeq1"     "RPPA2"        "RPPA3"
mylist<- list()

for (i in 1:length(mainterms)){

pickA <- interpret(CW, mainterms[i], 0.05)

rap <- pickA$Feature[pickA$Source == "RPPA"]
rap <- sapply(strsplit(rap, "\\."), function(x) x[1])
map <- pickA$Feature[pickA$Source == "MAF"]
nap <- pickA$Feature[pickA$Source == "mRNASeq"]
nap <- sapply(strsplit(nap, "\\."), function(x) x[1])
nap <- nap[-1]
nick <- rep("", nrow(pickA))
nick[pickA$Source == "RPPA"] <- rap
nick[pickA$Source == "MAF"] <- map
nick[pickA$Source == "mRNASeq"] <- c("", nap)

illuminaselect <- illumina %>% select( c("IlmnID", "UCSC_RefGene_Name"))
illuminaselect2<- separate_rows(illuminaselect, UCSC_RefGene_Name, sep=";")
illuminaselect2 <- as.data.frame(illuminaselect2)
illuminaselect3<- illuminaselect2 [which(illuminaselect2$IlmnID %in% pickA$Feature),]
illuminaselect4<- illuminaselect3[match(pickA$Feature[ pickA$Source=="Meth450"], illuminaselect3$IlmnID),]
illuminaselect4$UCSC_RefGene_Name[which(is.na(illuminaselect4$UCSC_RefGene_Name))] <- ""

nick[pickA$Source=="Meth450"]<- illuminaselect4$UCSC_RefGene_Name

pickA$Nickname <- nick  
  
  
mylist[[i]] <- pickA
}

names(mylist) <- mainterms
for (i in 1:length(mylist)){
  write.table(mylist[[i]]$Nickname, file=file.path("ToppGeneStuff", paste0("pick_", names(mylist)[i], ".txt")), quote = FALSE, 
              col.names =FALSE, row.names = FALSE)
}

Read in the .txt files containing results from ToppGene. Some interesting features to subset on are GO: Biological Process, Pathway, Coexpression, Coexpression Atlas, ToppCell Atlas, and Disease. These seem to be relevant to our interests and have a decent number of “Genes from Input”. Take the top 20 (based on lowest pvalues) of these enrichments and create a list of dataframes.

readinthese <- file.path("ToppGeneStuff", paste0("toppgene_", mainterms))
interested<- c("GO: Biological Process", "Pathway", "Coexpression", "Coexpression Atlas", "ToppCell Atlas", "Disease")
cutoffN <- 5

toppgeneres <- list()
for (i in 1:length(readinthese)){
  temp <- read.table(paste0(readinthese[i],".txt"), sep="\t", header=TRUE, fill=TRUE, quote="")
  temp2 <- temp[which(temp$Category %in% interested),]
  
  yeehaw <- list()
  
  for (k in 1:length(interested)){
    temp3 <-temp2[temp2$Category==interested[k],]
    temp4 <- head(temp3[order(temp3$p.value, decreasing = FALSE),], cutoffN)
    yeehaw[[k]] <- temp4
    yeet <- do.call(rbind, yeehaw)
    
  }
    toppgeneres[[i]] <- yeet
}

names(toppgeneres) <- mainterms
d<- do.call(rbind, toppgeneres) %>% 
  mutate( ComponentName =sapply(strsplit(rownames(.), "\\."),  `[[`, 1) ) %>% 
  select(ComponentName, everything()) %>%
  mutate(neglogpval = - log10(p.value))
library(ggplot2)
pdf(file=file.path("ToppGeneStuff", "toppgene_bubble.pdf"),width=20, height=20)
d %>% ggplot(aes(x=Name, y=ComponentName, size=as.numeric(Hit.Count.in.Query.List)))+
  geom_point(aes(color=neglogpval))+
  scale_colour_gradient(low = "pink", high="darkred")+
  scale_size(range=c(1,10), name="Hit Count in Query")+
  scale_x_discrete(guide=guide_axis(angle=90))+
  theme(panel.grid.major=element_line(colour="black"), 
        text=element_text(size=10))
dev.off()
## png 
##   2

Clustering the components for a neater visual:

uhh <- d[,c("ComponentName", "Name", "neglogpval")]

uhh[uhh==""]<-NA

uhh2 <- uhh %>% pivot_wider(values_from = neglogpval, names_from = Name, values_fn = mean) 

uhh2[is.na(uhh2)]<-0

h <- hclust(d=dist(uhh2))
## Warning in dist(uhh2): NAs introduced by coercion
ComponentNameNew<- uhh2$ComponentName[ order.dendrogram(as.dendrogram(h)) ]
myL <-list()

for (i in 1:length(ComponentNameNew)){
  
d_sub <- d[ComponentNameNew[i] == d$ComponentName,]

myL[[i]] <- d_sub  
}
d_new <- do.call(rbind, myL)
pdf(file=file.path("ToppGeneStuff", "toppgene_bubble_clustered.pdf"),width=20, height=22)
d_new %>% ggplot(aes(x=Name, y=ComponentName, size=as.numeric(Hit.Count.in.Query.List)))+
  geom_point(aes(color=neglogpval))+
  scale_colour_gradient(low = "pink", high="darkred")+
  scale_size(range=c(1,10), name="Hit Count in Query")+
  scale_x_discrete(guide=guide_axis(angle=90))+
  scale_y_discrete(limits=d_new$ComponentName)+
  theme(panel.grid.major=element_line(colour="black"), 
        text=element_text(size=10))
dev.off()
## png 
##   2

We see that components can be reasonably grouped together based on the top 5 items from the ontology categories we have chosen from ToppFunn (GO: Biological Process, Pathway, Coexpression, Coexpression Atlas, ToppCell Atlas, and Disease). In the largest group, we see that Meth4504, Meth4502, ClinicalBin1, mRNASeq1, Meth4503, and MAF1 are quite similar to each other. This group appears to be enriched for the term “Squamous cell”. The only thing separating mRNASeq1, Meth4503, and MAF1 are two myeloid-monocyte cell signatures and “Squamous cell carocinoma of the head and neck”. RPPA2 and RPPA3 are similar to each other and distinct from the group discussed previously in terms of enriched ontologies. Meth4501 also appears to be unique, and is notably the only component containing the term “adenocarcinoma” in its ToppFunn result.

Some unused code

Conclusions

We have identified a method analogous to that of MOFA that allows us to combine different omics data without the need for prior imputation of missing values. A major difference is that while MOFA model learns “factors” that are composites of the variables in an unsupervised fashion, the plasma model learns “components” that are composites of the variables in an supervised fashion, using the outcomes “event” and “time-to-event” as response variables.

Although the factors from MOFA are defined such that the efirst factor, Factor 1, accounts for the greatest variance in the model, the factors’ may or may not be significantly associated with the outcome, and a post-hoc survival analysis would need to be done to assess this. It may be the case that some factors, although they are significantly associated with outcome, account for very small variance in the final MOFA model, which hinders interpretability. This was the case with the TCGA-ESCA dataset, in which, when 10 factors were learned from the MOFA model, only Factor 10 was significantly associated with survival, while accounting for [number] variance in the model [CITE SUPPLEMENTARY RESULTS?]. On the other hand, the components for plasma are created in a way that maximizes the covariance in the predictors and the response, and therefore these components will be automatically associated to some degree with the outcome. This could be advantageous in that dissecting the weights associated with the components would yield a list of variables from different omics datasets that contribute the most to defining the outcome, and any additional analyses could be refined by looking at these high-weighted variables most closely.

Appendix: MOFA2

suppressWarnings( library(MOFA2) )
## 
## Attaching package: 'MOFA2'
## The following object is masked from 'package:plasma':
## 
##     predict
## The following object is masked from 'package:stats':
## 
##     predict
library(ggplot2)
library(dplyr)
library(tidyr)

In the final MOFA model, we would prefer that no single variable explains the most variance in the model simply because they encompass a wide range of values. For example, the untransformed “days_to_birth” column in “ClinicalCont” has a particularly wide range.

range(assemble$ClinicalCont["days_to_birth",])
## [1] -32872 -10222

We will create a vector of names of datasets that are continuous (not binary) and scale them around 0 before running MOFA.

# scale the columns that are not binary
notbinarydata<-c("ClinicalCont","Meth450","miRSeq","mRNASeq","RPPA")
UseMeScaled<- lapply(MO@data[notbinarydata], function(x) scale(x))
UseMeScaled$ClinicalBin <- MO@data$ClinicalBin
UseMeScaled$MAF <- MO@data$MAF
oof <- create_mofa(UseMeScaled)
## Creating MOFA object from a list of matrices (features as rows, sample as columns)...
oof
## Untrained MOFA model with the following characteristics: 
##  Number of views: 7 
##  Views names: ClinicalCont Meth450 miRSeq mRNASeq RPPA ClinicalBin MAF 
##  Number of features (per view): 6 1454 926 2520 192 53 566 
##  Number of groups: 1 
##  Groups names: group1 
##  Number of samples (per group): 185 
## 
plot_data_overview(oof)

First, we need to output the options and alter them if necessary. There are three, called “data options” “model options” and “training options”.

dataoptions <- get_default_data_options(oof)
dataoptions
## $scale_views
## [1] FALSE
## 
## $scale_groups
## [1] FALSE
## 
## $center_groups
## [1] TRUE
## 
## $use_float32
## [1] FALSE
## 
## $views
## [1] "ClinicalCont" "Meth450"      "miRSeq"       "mRNASeq"      "RPPA"         "ClinicalBin" 
## [7] "MAF"         
## 
## $groups
## [1] "group1"
modeloptions <- MOFA2::get_default_model_options(oof)
modeloptions$num_factors <-10

modeloptions
## $likelihoods
## ClinicalCont      Meth450       miRSeq      mRNASeq         RPPA  ClinicalBin          MAF 
##   "gaussian"   "gaussian"   "gaussian"   "gaussian"   "gaussian"   "gaussian"   "gaussian" 
## 
## $num_factors
## [1] 10
## 
## $spikeslab_factors
## [1] FALSE
## 
## $spikeslab_weights
## [1] TRUE
## 
## $ard_factors
## [1] FALSE
## 
## $ard_weights
## [1] TRUE
modeloptions[["likelihoods"]][["ClinicalBin"]] <- "bernoulli"
modeloptions[["likelihoods"]][["MAF"]] <- "bernoulli"

modeloptions
## $likelihoods
## ClinicalCont      Meth450       miRSeq      mRNASeq         RPPA  ClinicalBin          MAF 
##   "gaussian"   "gaussian"   "gaussian"   "gaussian"   "gaussian"  "bernoulli"  "bernoulli" 
## 
## $num_factors
## [1] 10
## 
## $spikeslab_factors
## [1] FALSE
## 
## $spikeslab_weights
## [1] TRUE
## 
## $ard_factors
## [1] FALSE
## 
## $ard_weights
## [1] TRUE
trainingoptions <- get_default_training_options(oof)
trainingoptions
## $maxiter
## [1] 1000
## 
## $convergence_mode
## [1] "fast"
## 
## $drop_factor_threshold
## [1] -1
## 
## $verbose
## [1] FALSE
## 
## $startELBO
## [1] 1
## 
## $freqELBO
## [1] 5
## 
## $stochastic
## [1] FALSE
## 
## $gpu_mode
## [1] FALSE
## 
## $seed
## [1] 42
## 
## $outfile
## NULL
## 
## $weight_views
## [1] FALSE
## 
## $save_interrupted
## [1] FALSE
oof <- prepare_mofa(oof, data_options = dataoptions, model_options = modeloptions, training_options = trainingoptions)
## Warning in prepare_mofa(oof, data_options = dataoptions, model_options = modeloptions, : Some
## view(s) have less than 15 features, MOFA will have little power to to learn meaningful factors
## for these view(s)....
## Checking data options...
## Checking training options...
## Checking model options...
# run MOFA if the file does not exist
if (file.exists("mofaESCAScaled.Rda")){
load("mofaESCAScaled.Rda") 
} else{
starttime<-Sys.time()
mofaESCAScaled <- run_mofa(oof, use_basilisk = TRUE)
save(mofaESCAScaled, file = "mofaESCAScaled.Rda")
endtime<-Sys.time()
endtime-starttime  
}

Looking at the factors:

plot_factor_cor

# How do the factors correlate to each other? 

plot_factor_cor(mofaESCAScaled)

# Variance explained: 

plot_variance_explained(mofaESCAScaled, plot_total = T)[[2]]

plot_variance_explained(mofaESCAScaled)

Heatmaps for the weight:

size <- sapply(UseMeScaled, dim)
type <- sapply(UseMeScaled, function(M) {
  U <- unique(as.vector(as.matrix(M)))
  ifelse (length(U) >3, "gaussian", "bernoulli")
})
heatme <- function() {
  Y <-  lapply(names(UseMeScaled), function(V) {
    sz <- size[1, V] < 200
    MOFA2::plot_weights_heatmap(mofaESCAScaled, view = V, main = V, show_colnames = sz, silent = TRUE)
  })
  Z <- lapply(Y, function(x) x$gtable)
  W <- gridExtra::arrangeGrob(Z[[1]], Z[[2]], Z[[3]], Z[[4]],
                              Z[[5]], Z[[6]], Z[[7]], nrow = 3)
  W
}
X <- heatme()
plot(X)

Looking at each factor:

factorTopWeights <- function(J, mymofainput, mymofaoutput) {
  Z <-  lapply(names(mymofainput), function(V) {
    plot_top_weights(mymofaoutput, view = V, factor = J) + ggplot2::ggtitle(V)
  })
  W <- gridExtra::grid.arrange(Z[[1]], Z[[2]], Z[[3]], Z[[4]],
                               Z[[5]], Z[[6]], Z[[7]], 
                               nrow = 3)
}

factorTopWeights(J=1, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)

factorTopWeights(J=2, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)

factorTopWeights(J=3, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)

factorTopWeights(J=4, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)

factorTopWeights(J=5, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)

factorTopWeights(J=6, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)

factorTopWeights(J=7, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)

factorTopWeights(J=8, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)

factorTopWeights(J=9, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)

factorTopWeights(J=10, mymofainput = UseMeScaled, mymofaoutput = mofaESCAScaled)

Relationship to Overall Survival

LF <- get_factors(mofaESCAScaled, groups = "all", factors = "all") [[1]]

all(rownames(MO@outcome) == rownames(LF))
## [1] TRUE
vitalstatus2 <- ifelse(MO@outcome$vital_status=="alive", 0,
                       ifelse(MO@outcome$vital_status=="dead", 1, NA))
LFwithOutcome<- data.frame(LF, event =  vitalstatus2, time_to_event = MO@outcome$Days) 
LF2 <- as.data.frame(t(LF))

OSresults <- matrix(NA, nrow=nrow(LF2), ncol = 4)
rownames(OSresults) <- rownames(LF2)
colnames(OSresults) <- c("coef", "HR", "score", "pvalue")


for (J in 1:nrow(LF2)) {
  tempdf <- data.frame(Time = LFwithOutcome$time_to_event,
                       Event = LFwithOutcome$event,
                       X <- LF2[J,])
  model <- coxph(Surv(Time, Event) ~ unlist(X), data = tempdf)
  S <- summary(model)
  val <- c(S$coefficients[1:2], S$sctest[c(1,3)])
  OSresults[J,] <- val}

OSresults
##                 coef        HR      score     pvalue
## Factor1   0.04034621 1.0411712 0.36946415 0.54329650
## Factor2   0.07980337 1.0830741 0.81726352 0.36598147
## Factor3  -0.03396494 0.9666054 0.06875337 0.79316059
## Factor4  -0.06938098 0.9329712 0.46431681 0.49561339
## Factor5   0.03659047 1.0372681 0.08116115 0.77572998
## Factor6   0.07235519 1.0750371 0.29335907 0.58807596
## Factor7  -0.11561998 0.8908137 0.67336613 0.41188048
## Factor8   0.13690316 1.1467171 1.37975043 0.24014366
## Factor9   0.12882066 1.1374861 0.50075710 0.47916765
## Factor10 -0.30654878 0.7359826 4.96757135 0.02582689
osmodel <- survival::coxph(Surv(LFwithOutcome$time_to_event, LFwithOutcome$event) ~. , data = LFwithOutcome)
AIC <- stats::step(osmodel)

summary(osmodel)
summary(AIC)

Factor 10 is the only factor significantly associated with survival.

Looking at Hazards ratio - OS:

s <- summary(osmodel)
coef <- s[["coefficients"]]

df <- data.frame(
  factor = factor(rownames(coef), levels = rev(rownames(coef))),
  p      = coef[,"Pr(>|z|)"], 
  coef   = coef[,"exp(coef)"], 
  lower  = s[["conf.int"]][,"lower .95"], 
  higher = s[["conf.int"]][,"upper .95"]
)

ggplot(df, aes(x=factor, y=coef, ymin=lower, ymax=higher)) +
  geom_pointrange( col='#619CFF') + 
  coord_flip() +
  scale_x_discrete() + 
  labs(y="Hazard Ratio", x="") + 
  geom_hline(aes(yintercept=1), linetype="dotted") +
  theme_bw()

The only factor significantly associated with survival outcome is Factor 10, which is marginal in terms of the variance explained in the model:

library(tidyr)
v10<- plot_variance_explained(mofaESCAScaled)
v10wide<-pivot_wider(v10$data, names_from="factor", values_from="value")
v10wide<-v10wide %>% select(-c("group"))
v10wide<-as.data.frame(v10wide) 
rownames(v10wide) <- v10wide[,1]
v10wide<-v10wide[,-1]
v10wide_t<-t(v10wide)
sort(rowSums(v10wide_t), decreasing = TRUE)
##    Factor1    Factor2    Factor3    Factor4    Factor5    Factor6    Factor7    Factor8 
## 107.469140  14.494450  12.729408  12.519620   9.193192   6.284764   6.100425   5.899442 
##    Factor9   Factor10 
##   5.712097   4.136799

The variances don not total to 100, so they are not really like what we normally think of when we say \(R^2\). Possibly it would be easier to interpret if we force them into percentages like this:

barplot(sort( rowSums(v10wide_t)/sum(rowSums(v10wide_t)) , decreasing = TRUE ), las=2, 
        main="Relative R^2")

sort( rowSums(v10wide_t)/sum(rowSums(v10wide_t)) , decreasing = TRUE )
##    Factor1    Factor2    Factor3    Factor4    Factor5    Factor6    Factor7    Factor8 
## 0.58236440 0.07854396 0.06897937 0.06784255 0.04981698 0.03405650 0.03305759 0.03196848 
##    Factor9   Factor10 
## 0.03095328 0.02241689

Some tentative conclusions from MOFA

  • This may highlight a crucial problem in MOFA, in which Factors that have a large variance explained in the model (usually the first few factors) may not be significantly related to survival outcome when doing the post-hoc analysis.

  • This is because these factors were learned with respect to explaining the most variance within the model, without any supervision. There is some “luck” associated with finding Factors that contribute to defining the model and are also significantly related to outcome.

  • Plasma solves this problem of the Factors becoming dissociated with outcome, since plasma is a supervised analysis that purposefully learns components that explain the outcome.